Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new ARC job runner #16653

Draft
wants to merge 23 commits into
base: dev
Choose a base branch
from
Draft

Conversation

maikenp
Copy link
Contributor

@maikenp maikenp commented Sep 6, 2023

This feature is a prototype created in the context of ESG.

A new job runner - for submitting jobs to a remote ARC middleware endpoint. Behind the ARC endpoint is a regular batch system (slurm, htcondor etc). ARC endpoints are typically in front of designated grid infrastructures or HPC systems shared by other users.

The use-case is a user that has access and compute-time to one or several HPC systems that are served by ARC. Instead of using the traditional command-line approach to send the jobs, Galaxy can be used. The authentication is handled by ARC, and the uploading of remote input datasets is also handled by ARC. The user would use the regular way of authenticating himself with ARC via tokens issued by one of the endpoints supported OIDC token providers. For this to be convenient the user should log into Galaxy using this supported OIDC token provider. Currently a WLCG IAM backend is in PR to social_core (python-social-auth/social-core#820) which is one of the supported IdPs. The Galaxy server must be configured to use this oidc as a login-option by configuring the oidc_backends_config.yml. See details here: #16617

Screenshot 2023-09-07 at 10 35 10

Currently, the user must himself configure his user preferences with a single ARC endpoint to use. There is a feature request to ask for the possibility for the user to add as many endpoints are wanted: #16596
The form allowing users to add an ARC endpoint is currently merged into the usegalaxy-eu infrastructure-playbook: usegalaxy-eu/infrastructure-playbook#879 . If you want to try this out, just have a look at this and apply the small additions to the relevant files.

Screenshot 2023-09-07 at 10 37 07

Currently also, the runner only works with a specific tool designated for ARC jobs. It has not yet been made ready for the using with toolbox tools. The arc-tool config file looks like this:

<tool id="hello_arc" name="ARC hello">
  <description>is a simple hello world python script</description>
   <inputs>
   <param name="message" type="text" value="Hi from galaxy" label="Hello message"/>
   <param name="arcjob_exe" type="data" label="ARC executable"/>
   <param name="arcjob_cpuhrs" type="text" value="1" label="Job CPU hours"/>
   <param name="arcjob_memory" type="text" value="100" label="Job memory"/>-->
 </inputs>

  <outputs>
   <collection name="output" type="list"  label="ARC outputs">
     <discover_datasets pattern="__designation__"  directory="./" recurse="true" />
   </collection>
 </outputs>

  <help>
    TODO
  </help>
</tool>

Screenshot 2023-09-07 at 10 37 43

The tool expects that you upload an executable. In this case a simple bash script with the following contents:

#!/bin/bash

/bin/echo "Hello from runscript to arcout1.txt" > ./arcout1.txt
/bin/echo "Hello from runscript to arcout2.txt" > ./arcout2.txt

/bin/echo "Running on compute node: $(hostname)" >> ./arcout1.txt
/bin/echo "Time is $(date)" >> ./arcout1.txt

/bin/echo "Running on compute node: $(hostname)" >> ./arcout2.txt
/bin/echo "Time is $(date)" >> ./arcout2.txt

/bin/echo "Sleeping for 60s " >> ./arcout1.txt
/bin/echo "Sleeping for 60s " >> ./arcout2.txt

>&1 /bin/echo "Some output to stdout from the runhello.sh executable."
/bin/sleep 60

But the executable could very well be some scientific code in C++ or whatever, as long as the ARC endpoint supports this software (this will be handled properly with so-called ARC runtime environment settings and matchmaking later). Currently this tool allows upload of a single executable and no other input-data just for testing purposes. In the future the job runner will support uploading of arbitrary number of input files from galaxy, in addition to remote input files (ARC collects these from the remote source).

Once the job is done, all the output files created by the jobs executable and ARC itself are uploaded into Galaxy as expected.

How to test the changes?

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 23.2 milestone Sep 6, 2023
@maikenp maikenp marked this pull request as draft September 6, 2023 16:28
Require version 0.2 which is compatible with the ARC job runner.
@jmchilton
Copy link
Member

I have a couple refactorings I consider relatively important - can you pull them out of https://github.com/jmchilton/galaxy/tree/arc_job_runner.

This brings it inline with other job runners, makes it more isolated for potential unit testing, simplifies some big overloaded methods in the runner, and would allow a lot of this logic to be reused with a potential Pulsar job runner. I think Pulsar is the way you want to go long term but I am happy to allow a Galaxy job runner for this as long as you don't get into like command-rewriting logic - which there isn't here - you're staging paths as is. That staging could be more targeted with Pulsar but I'm fine with being weary of premature optimization.

I haven't tested my changes - I think the refactorings are vaguely correct but I probably introduced some bugs - hopefully if there are regressions they will be easy to track down but if I broke something and you want advice - happy to work with you on that.

My next refactoring would be this.... almost global ARCJob thing? You're reusing the same object for every job and every request? Shouldn't this be somehow constructed per job? Is there an issue with that? I'd be very anxious about threading and cleaning things up and such as things are now?

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

I think addressing these three points would make it easier for me to grok what is going on with the job runner and provide additional feedback.



def job_actions(self, arcid, action):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - can you just make this two separate methods with two separate returns. This sort method dispatching weakens our tool chains ability to do static checking and I think makes the code a little bit harder to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will look at it!

log.error(f'Could not get status of job with id: {arcid}')
elif action == "kill":
results = self.arcrest.killJobs([arcid])
for killed in results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return after the first iteration, so really there's no iteration - is that intentional?
Also, you could just return bool(killed) instead of the if/then.

Comment on lines 91 to 92
self._init_monitor_thread()
self._init_worker_threads()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are called automatically on galaxy startup; the logic goes like this: app.py > JobManager().start() > JobHandler().start() > DefaultJobDisplatcher().start() > for each job runner runner.start() where both methods are called here. Unless this is a special case (which I missed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review/check if this is necessary, I just followed https://docs.galaxyproject.org/en/master/dev/build_a_job_runner.html
Maybe it is outdated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that part of the documentation can be interpreted both ways (i.e., this is what the runner does vs. this is the code that should be added). I'd leave it out - it should work: all runners start those threads, which is why it's handled in the base runner.

@jdavcs
Copy link
Member

jdavcs commented Sep 7, 2023

To get rid of the linting errors, you can just make format and tox -e mypy before pushing.

@maikenp
Copy link
Contributor Author

maikenp commented Sep 7, 2023

Thanks for looking closely at this and spotting issues and suggesting improvements. I will implement the changes.

@maikenp
Copy link
Contributor Author

maikenp commented Sep 8, 2023

I have a couple refactorings I consider relatively important - can you pull them out of https://github.com/jmchilton/galaxy/tree/arc_job_runner.

This brings it inline with other job runners, makes it more isolated for potential unit testing, simplifies some big overloaded methods in the runner, and would allow a lot of this logic to be reused with a potential Pulsar job runner. I think Pulsar is the way you want to go long term but I am happy to allow a Galaxy job runner for this as long as you don't get into like command-rewriting logic - which there isn't here - you're staging paths as is. That staging could be more targeted with Pulsar but I'm fine with being weary of premature optimization.

I haven't tested my changes - I think the refactorings are vaguely correct but I probably introduced some bugs - hopefully if there are regressions they will be easy to track down but if I broke something and you want advice - happy to work with you on that.

My next refactoring would be this.... almost global ARCJob thing? You're reusing the same object for every job and every request? Shouldn't this be somehow constructed per job? Is there an issue with that? I'd be very anxious about threading and cleaning things up and such as things are now?

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

I think addressing these three points would make it easier for me to grok what is going on with the job runner and provide additional feedback.

Thanks a lot for these suggestions, improves code a lot. I have applied the changes, and will test now.

About the ARCJob - I might not need this anymore in fact. It was needed in an earlier version of pyarcrest, but in the meantime we have done some refactoring and I think it is of no use anymore. So will look into that now. And yes, your concern is very correct related to the ARCJob per job issue, I noticed what I had done the other day and noted that I had to have a look at what I had actually implemented there, and that it seemed wrong.

…cessary job_actions method. Further improvements to the action on jobs will come.
@maikenp
Copy link
Contributor Author

maikenp commented Sep 8, 2023

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

@jmchilton

I am rewriting to be able to use the job_runner_external_id. However, in check_watched_item - I seem to have difficulties getting id.

Is
job_state.job_wrapper.get_job().get_job_runner_external_id() supposed to have access to this id?

It returns None for me.

Even though I in queue_job do:

job_wrapper.get_job().job_runner_external_id = arc_jobid
And have verified that at this point that
job_wrapper.get_job().get_job_runner_external_id() indeed returns the correct ARC id

But in check_watched_item the get_job_runner_external_id() returns None.
I also tried just job_runner_external_id() but same result.

And by the way: using a getter to get the id, it would be nice to use a setter to set it. But I see the setter is an attribute of the Task object, and not the Job object. Any opinions on this?

@maikenp
Copy link
Contributor Author

maikenp commented Sep 11, 2023

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

@jmchilton

I am rewriting to be able to use the job_runner_external_id. However, in check_watched_item - I seem to have difficulties getting id.

Is job_state.job_wrapper.get_job().get_job_runner_external_id() supposed to have access to this id?

It returns None for me.

Even though I in queue_job do:

job_wrapper.get_job().job_runner_external_id = arc_jobid And have verified that at this point that job_wrapper.get_job().get_job_runner_external_id() indeed returns the correct ARC id

But in check_watched_item the get_job_runner_external_id() returns None. I also tried just job_runner_external_id() but same result.

And by the way: using a getter to get the id, it would be nice to use a setter to set it. But I see the setter is an attribute of the Task object, and not the Job object. Any opinions on this?

Sorted this out.

@maikenp
Copy link
Contributor Author

maikenp commented Sep 13, 2023

About testing - so finally (sorry for the delay and multiple tries) fixed whitespace linting errors and other stuff actively using

tox -e lint
tox -e mypy

Now trying to sort out the

tox -e check_indexes

error I see here, but running locally all is ok.

<snip>
The Galaxy client build is being skipped due to the SKIP_CLIENT_BUILD environment variable.
check_indexes: commands[1]> bash manage_db.sh init
Activating virtualenv at .venv
INFO:galaxy.model.migrations:Creating database for URI [sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE]
INFO:alembic.runtime.migration:Context impl SQLiteImpl.
INFO:alembic.runtime.migration:Will assume non-transactional DDL.
INFO:alembic.runtime.migration:Running stamp_revision  -> 987ce9839ecb
DEBUG:alembic.runtime.migration:new branch insert 987ce9839ecb
INFO:alembic.runtime.migration:Context impl SQLiteImpl.
INFO:alembic.runtime.migration:Will assume non-transactional DDL.
INFO:alembic.runtime.migration:Running stamp_revision  -> d4a650f47a3c
DEBUG:alembic.runtime.migration:new branch insert d4a650f47a3c
check_indexes: commands[2]> bash check_model.sh
Activating virtualenv at .venv
/storage/gitrepos/forks/galaxy/./scripts/check_model.py:50: RemovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
  metadata = MetaData(bind=create_engine(db_url))
  check_indexes: OK (259.65=setup[0.14]+cmd[242.92,13.78,2.80] seconds)
  congratulations :) (259.71 seconds)

But in the repo tests I see:

Successfully installed httplib2-0.22.0 oauth2client-4.1.3 psycopg2-binary-2.9.5 pyasn1-modules-0.3.0 pykube-0.15.0
Installing node into /home/runner/work/galaxy/galaxy/galaxy root/.venv with nodeenv.
 * Install prebuilt node (18.12.1) .Failed to download from https://nodejs.org/download/release/v18.12.1/node-v18.12.1-linux-x64.tar.gz
..
Traceback (most recent call last):
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/bin/nodeenv", line 10, in <module>
    sys.exit(main())
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 1122, in main
    create_environment(env_dir, args)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 998, in create_environment
    install_node(env_dir, src_dir, args)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 755, in install_node
    install_node_wrapped(env_dir, src_dir, args)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 790, in install_node_wrapped
    copy_node_from_prebuilt(env_dir, src_dir, args.node)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 681, in copy_node_from_prebuilt
    src_folder, = glob.glob(src_folder_tpl)
ValueError: not enough values to unpack (expected 1, got 0)
check_indexes: exit 1 (157.23 seconds) /home/runner/work/galaxy/galaxy/galaxy root> bash scripts/common_startup.sh pid=2159
  check_indexes: FAIL code 1 (159.03=setup[1.79]+cmd[157.23] seconds)
  evaluation failed :( (159.43 seconds)
Error: Process completed with exit code 1.

What to do to fix this error?

@jdavcs
Copy link
Member

jdavcs commented Sep 13, 2023

I've restarted those tests - they've just passed, so the previous error was unrelated.

@maikenp maikenp marked this pull request as ready for review September 13, 2023 20:33
@maikenp
Copy link
Contributor Author

maikenp commented Sep 14, 2023

The failed jobs - not sure how I should fix this. Is it related or unrelated to my changes?

@maikenp maikenp marked this pull request as draft September 15, 2023 08:27
@maikenp maikenp mentioned this pull request Sep 27, 2023
4 tasks
@mvdbeek mvdbeek modified the milestones: 23.2, 24.0 Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants